Solution: Establish a Big-Tent Culture of Quality
Let's learn what would be required for constructing a quality product.
Quality simply comprises testing for most software developers. But that’s only quality control and it’s only part of the story. The entire life cycle of software engineering involves quality assurance, which includes three parts:
- Specifying project requirements clearly and in writing.
- Designing and developing a solution for our requirements.
- Validating and testing that our solution matches the requirements.
We need to do all three of these to perform QA correctly, although, in some software methodologies, we don’t necessarily have to do them in this order.
We can achieve quality assurance in database development by following best practices in documentation, source code control, and testing.
Exhibit A: Documentation#
There’s no such thing as self-documenting code. Although a skilled programmer can indeed decipher most code through a combination of careful analysis and experimentation, this is laborious. Also, code can’t tell us about missing features or unsolved problems.
We should document the requirements and implementation of a database just as we do with application code. Whether we’re the original designer of the database or we’re inheriting a database designed by someone else, we use the following checklist to document a database:
Entity-relationship diagram
The single most crucial piece of documentation for a database is an ER diagram showing the tables and their relationships.
Several chapters in this course use a simple form of ER diagrams. More complex ER diagrams have a notation for columns, keys, indexes, and other database objects.
Some diagramming software packages include elements for ER diagram notation. Some tools can even reverse-engineer an SQL script or a live database and produce an ER diagram.
One caveat is that databases can be complex and have so many tables that it’s impractical to use a single diagram. In this case, we should decompose it into several diagrams. Usually, we can choose natural subgroups of tables so that each diagram is readable enough to be helpful and not overwhelming for the reader.
Tables, columns, and views
We also need written documentation for our database because an ER diagram isn’t the suitable format to describe the purpose and usage of each table, column, and other objects.
Tables need a description of what type of entity the table models. For example, Bugs
, Products
, and Accounts
are pretty clear, but what about a lookup table like BugStatus
, an intersection table like BugsProducts
, or a dependent table like Comments
? Also, how many rows do we anticipate each table to have? What queries against this table are we expecting? What indexes exist in this table?
Every column has a name and a data type, but that doesn’t tell the reader what the column’s values mean. What values make sense in that column (it’s rarely the full range of the data type)? For columns storing a quantitative value, what is the unit of measurement? Does the column allow nulls or not, and why? Does it have a unique constraint, and if so, why?
Views store frequently used queries against one or more tables. What made it worthwhile to create a given view? What application or user is expected to use the view? Was the view intended to abstract a complex relationship of tables? Does it exist as a way to allow unprivileged users to query a subset of rows or columns in a privileged table? Is the view updatable?
Relationships
Referential integrity constraints are used to implement dependencies between tables, but this may create implied relationships within the model. For example, Bugs.reported_by
is not nullable, but Bugs.assigned_to
is nullable. Does that mean a bug can be fixed before it’s assigned? If not, what are the business rules for when the bug must be assigned?
In some cases, we may have implicit relationships but no constraints for them. Without documentation, it’s hard to know where these relationships exist.
Triggers
Data validation, data transformation, and logging database changes are examples of tasks for a trigger. What business rules are we implementing in triggers?
Stored procedures
We document our stored procedures like an API. What problem is the procedure solving? Does a procedure perform any changes to data? What are the data types and meanings of the input and output parameters? Do we intend the procedure to replace a certain type of query to eliminate a performance bottleneck? Do we use the procedure to grant unprivileged users access to privileged tables?
SQL security
What database users do we define for our applications to use? What access privileges does each of these users have? What SQL roles do we provide, and which users can use them? Are any users designated for specific tasks, such as backups or reports? What system-level security provisions do we use? Perhaps a client needs to reach the RDBMS server with SSL. What measures do we take to detect and block attempts at illicit authentication, like brute-force password guessing? Have we done a thorough code review for SQL Injection vulnerabilities?
Database infrastructure
TThis information is chiefly used by IT staff and DBAs, but developers need to know some of it too:
- What RDBMS brand and version do we operate?
- What is our database server hostname?
- Do we use multiple database servers, replication, clusters, proxies, and so on?
- What is our network organization, and what is the port number used by the database server?
- What connection options do client applications need to use?
- What are the database user passwords?
- What are our database backup policies?
Object-relational mapping
Our project may implement some database-handling logic in application code as part of a layer of ORM-based code classes. What are the business rules that are implemented in this way for each of data validation, data transformation, logging, caching, or profiling?
Developers don’t like to maintain engineering documentation. It’s hard to write, it’s hard to keep up-to-date, and it’s disappointing if few people read what you wrote. But even battle-hardened, extreme programmers know that they need to document the database, even if they document no other part of their software. For example, Jeff Atwood and Joel Spolsky cite seeing little value in documenting code, except for the database, in StackOverflow podcast # 80.
Trail of evidence: Source code control#
If our database server failed utterly, how would we re-create a database? What’s the best way to track a complex upgrade to our database design? How would we reverse a change?
We know how we would use a source control system to manage application code, solving similar problems of software development. A project under source control should include everything we need to rebuild and redeploy the project if our existing deployment explodes. Source control also serves as a history of changes and an incremental backup so we can reverse any of these changes.
We can use source control with our database code and get similar benefits for development.
We should check the files related to our database development in the source control, including the following:
Data definition scripts
All brands of the database provide ways to execute SQL scripts containing CREATE TABLE
and other statements that define the database objects.
Triggers and procedures
Many projects supplement application code with routines stored in the database. Our application probably won’t work without these routines, so they count as part of our project’s code.
Bootstrap data
Lookup tables may contain some set of data that represents an initial state of our database before any user enters new data. We should keep bootstrap data to consult if we need to re-create a database from our project source. Bootstrap data is also called seed
data.
ER diagrams and documentation
These files are not codes, but they’re closely tied to the code, describing database requirements, implementation, and integration with the application. As the project evolution results in changes to both the database and the application, we should keep these files up-to-date and make sure that the documents describe the current designs.
DBA scripts
Most projects have a collection of data-handling jobs that run outside the application. These include import/export, synchronization, reporting, backups, validation, testing, and so on. These may be written as SQL scripts, and not part of a conventional application programming language.
We must make sure that our database code files are associated with the application code that uses that database. Part of the benefit of using source control is that the files should work together if we check our project from the source control, given a certain revision number, date, or milestone. We must use the same source control repository for both application code and database code.
The burden of proof: Testing#
The final part of quality assurance is quality control — validating that our application does what it set out to do.
Most professional developers are familiar with techniques to write automated tests to validate application code behavior. A critical principle of testing is isolation — testing only one part of the system at a time so that if a defect exists, we can narrow down its location as precisely as possible.
We can extend the practice of isolation testing to the database by validating the database structure and behavior independently from our application code.
The following example shows a unit test script using the PHPUnit test framework:
<?php
require_once "PHPUnit/Framework/TestCase.php";
class DatabaseTest extends PHPUnit_Framework_TestCase
{
protected $pdo;
public function setUp()
{
$this->pdo = new PDO("mysql:dbname=bugs", "testuser", "xxxxxx");
}
public function testTableFooExists()
{
$stmt = $this->pdo->query("SELECT COUNT(*) FROM Bugs");
$err = $this->pdo->errorInfo();
$this->assertType("object", $stmt, $err[2]);
$this->assertEquals("PDOStatement", get_class($stmt));
}
public function testTableFooColumnBugIdExists()
{
$stmt = $this->pdo->query("SELECT COUNT(bug_id) FROM Bugs");
$err = $this->pdo->errorInfo();
$this->assertType("object", $stmt, $err[2]);
$this->assertEquals("PDOStatement", get_class($stmt));
}
static public function main()
{
$suite = new PHPUnit_Framework_TestSuite(__CLASS__);
$result = PHPUnit_TextUI_TestRunner::run($suite);
}
}
DatabaseTest::main();
?>
We can use the following checklist for tests that validate our database:
Tables, columns, views
We should test to see whether the tables and views that we expect are there in our database actually exist. Each time we enhance the database with a new table, view, or column, we add a new test that confirms that the object is present. We can also use negative tests to verify that a table or column we removed in the current revision of our project is, in fact, no longer present.
Constraints
This is another use of negative testing. Try to execute INSERT
, UPDATE
, or DELETE
statements that should result in an error because of a constraint. For example, try to violate non-null
, unique constraints, or foreign keys. If the statement doesn’t return an error, then our constraint isn’t working. We can catch many bugs early by identifying these failures.
Triggers
Triggers can enforce constraints too. Triggers can perform cascading effects, transform values, log changes, and so on. We should test these scenarios by executing a statement that spawns the trigger and then querying to confirm that the trigger performed the action we intended.
Stored procedures
Testing procedures in the database are closest to conventional unit testing of application code. A stored procedure has input parameters, and would throw errors if we try to pass its values outside the valid input range. Logic within the body of the procedure could allow multiple execution paths. The procedure could return a single value or a query result set, depending on the inputs and the state of data in the database. Also, the procedure could have “side effects” in the form of updates within the database. It’s a good idea to test all of these features of procedures to get a better understanding of how they operate.
Bootstrap data
Even a supposedly empty database typically needs some initial data, such as in lookup tables. We can run queries to validate that the initial data is present.
Queries
Application code is laced with SQL queries. We can execute queries in a test environment to validate syntax and results, such as to confirm that the result set includes the column names and data types we expect, just like testing tables and views.
ORM classes
Like triggers, ORM classes contain logic, including validation, transformation, and monitoring. We should test our ORM-based database abstraction code as we would any other application code. It is a good practice to confirm that these classes do the expected actions with input and also that they reject invalid input.
Schema Evolution Tools
If any of our tests fail, it could be because our application is using the wrong database instance. Thus, we must always double-check that we’re connecting to the right database. In case we are connecting to the wrong instance — which happens frequently enough — we must edit the configuration and try again. If, on the other hand, we’re sure that our connection is proper, but we need to alter the database, we can run a “migration script” (see Schema Evolution Tools above) to synchronize this database instance to match what our application expects.
Caseload: Working in multiple branches#
While we develop our application, we could work on multiple revisions of the code. We might even work on different revisions on the same day. For example, we could fix an urgent bug in the branch of the application currently deployed and then, moments later, resume working on long-term development in the main branch.
But the database our application uses isn’t under revision control. It’s not practical to set up and tear down a database at a moment’s notice, even if the database brand we use is relatively agile and easy to use.
Ideally, we should create a separate instance of our database for each revision of the application we need to develop, test, stage, or deploy. Also, each developer in our project team needs a separate database instance so they can work without interfering with the rest of the team.
We can make our application support a configurable means to specify database connection parameters so that whichever application revision we work on, we can specify which database to use without overwriting code.
Today, every RDBMS brand, both commercial and open-source, offers a free solution for development and testing. Platform virtualization technologies, such as VMware Workstation, Xen, and VirtualBox, allow every developer to run a clone of the server infrastructure at little cost. There is no reason that software developers can’t develop and test in a fully functional environment that matches the production environment.